{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preprocessing new datasets\n",
    "\n",
    "TopMost can preprocess datasets for topic modeling in a standard way.\n",
    "A dataset must include two files: `train.jsonlist` and `test.jsonlist`. Each contains a list of json, like\n",
    "\n",
    "```json\n",
    "{\"label\": \"rec.autos\", \"text\": \"WHAT car is this!?...\"}\n",
    "{\"label\": \"comp.sys.mac.hardware\", \"text\": \"A fair number of brave souls who upgraded their...\"}\n",
    "```\n",
    "\n",
    "`label` is optional.\n",
    "\n",
    "**Here we download and preprocess 20newsgroup as follows.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": []
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'20ng_all': ['talk.religion.misc', 'comp.windows.x', 'rec.sport.baseball', 'talk.politics.mideast', 'comp.sys.mac.hardware', 'sci.space', 'talk.politics.guns', 'comp.graphics', 'comp.os.ms-windows.misc', 'soc.religion.christian', 'talk.politics.misc', 'rec.motorcycles', 'comp.sys.ibm.pc.hardware', 'rec.sport.hockey', 'misc.forsale', 'sci.crypt', 'rec.autos', 'sci.med', 'sci.electronics', 'alt.atheism']}\n",
      "name:  20ng_all\n",
      "categories:  ['talk.religion.misc', 'comp.windows.x', 'rec.sport.baseball', 'talk.politics.mideast', 'comp.sys.mac.hardware', 'sci.space', 'talk.politics.guns', 'comp.graphics', 'comp.os.ms-windows.misc', 'soc.religion.christian', 'talk.politics.misc', 'rec.motorcycles', 'comp.sys.ibm.pc.hardware', 'rec.sport.hockey', 'misc.forsale', 'sci.crypt', 'rec.autos', 'sci.med', 'sci.electronics', 'alt.atheism']\n",
      "subset:  train\n",
      "Downloading articles\n",
      "data size:  11314\n",
      "Saving to ./datasets/20NG\n",
      "name:  20ng_all\n",
      "categories:  ['talk.religion.misc', 'comp.windows.x', 'rec.sport.baseball', 'talk.politics.mideast', 'comp.sys.mac.hardware', 'sci.space', 'talk.politics.guns', 'comp.graphics', 'comp.os.ms-windows.misc', 'soc.religion.christian', 'talk.politics.misc', 'rec.motorcycles', 'comp.sys.ibm.pc.hardware', 'rec.sport.hockey', 'misc.forsale', 'sci.crypt', 'rec.autos', 'sci.med', 'sci.electronics', 'alt.atheism']\n",
      "subset:  test\n",
      "Downloading articles\n",
      "data size:  7532\n",
      "Saving to ./datasets/20NG\n",
      "name:  20ng_all\n",
      "categories:  ['talk.religion.misc', 'comp.windows.x', 'rec.sport.baseball', 'talk.politics.mideast', 'comp.sys.mac.hardware', 'sci.space', 'talk.politics.guns', 'comp.graphics', 'comp.os.ms-windows.misc', 'soc.religion.christian', 'talk.politics.misc', 'rec.motorcycles', 'comp.sys.ibm.pc.hardware', 'rec.sport.hockey', 'misc.forsale', 'sci.crypt', 'rec.autos', 'sci.med', 'sci.electronics', 'alt.atheism']\n",
      "subset:  all\n",
      "Downloading articles\n",
      "data size:  18846\n",
      "Saving to ./datasets/20NG\n"
     ]
    }
   ],
   "source": [
    "from topmost import download_20ng, Preprocess\n",
    "\n",
    "# download raw data\n",
    "download_20ng.download_save(output_dir='./datasets/20NG')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"group\": \"rec.autos\", \"text\": \"From: lerxst@wam.umd.edu (where's my thing)\\nSubject: WHAT car is this!?\\nNntp-Posting-Host: rac3.wam.umd.edu\\nOrganization: University of Maryland, College Park\\nLines: 15\\n\\n I was wondering if anyone out there could enlighten me on this car I saw\\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\\nthe front bumper was separate from the rest of the body. This is \\nall I know. If anyone can tellme a model name, engine specs, years\\nof production, where this car is made, history, or whatever info you\\nhave on this funky looking car, please e-mail.\\n\\nThanks,\\n- IL\\n   ---- brought to you by your neighborhood Lerxst ----\\n\\n\\n\\n\\n\"}\n",
      "{\"group\": \"comp.sys.mac.hardware\", \"text\": \"From: guykuo@carson.u.washington.edu (Guy Kuo)\\nSubject: SI Clock Poll - Final Call\\nSummary: Final call for SI clock reports\\nKeywords: SI,acceleration,clock,upgrade\\nArticle-I.D.: shelley.1qvfo9INNc3s\\nOrganization: University of Washington\\nLines: 11\\nNNTP-Posting-Host: carson.u.washington.edu\\n\\nA fair number of brave souls who upgraded their SI clock oscillator have\\nshared their experiences for this poll. Please send a brief message detailing\\nyour experiences with the procedure. Top speed attained, CPU rated speed,\\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\\nfunctionality with 800 and 1.4 m floppies are especially requested.\\n\\nI will be summarizing in the next two days, so please add to the network\\nknowledge base if you have done the clock upgrade and haven't answered this\\npoll. Thanks.\\n\\nGuy Kuo <guykuo@u.washington.edu>\\n\"}\n"
     ]
    }
   ],
   "source": [
    "! head -2 ./datasets/20NG/train.jsonlist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://raw.githubusercontent.com/BobXWu/TopMost/master/data/stopwords.zip\n",
      "Using downloaded and verified file: ./datasets/stopwords.zip\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-06-19 22:40:37,045 - TopMost - Found training documents 11314 testing documents 7532\n",
      "2024-06-19 22:40:37,050 - TopMost - label2id: {'alt.atheism': 0, 'comp.graphics': 1, 'comp.os.ms-windows.misc': 2, 'comp.sys.ibm.pc.hardware': 3, 'comp.sys.mac.hardware': 4, 'comp.windows.x': 5, 'misc.forsale': 6, 'rec.autos': 7, 'rec.motorcycles': 8, 'rec.sport.baseball': 9, 'rec.sport.hockey': 10, 'sci.crypt': 11, 'sci.electronics': 12, 'sci.med': 13, 'sci.space': 14, 'soc.religion.christian': 15, 'talk.politics.guns': 16, 'talk.politics.mideast': 17, 'talk.politics.misc': 18, 'talk.religion.misc': 19}\n",
      "loading train texts: 100%|██████████| 11314/11314 [00:03<00:00, 3730.48it/s]\n",
      "loading test texts: 100%|██████████| 7532/7532 [00:01<00:00, 3993.64it/s]\n",
      "parsing texts: 100%|██████████| 11314/11314 [00:01<00:00, 9002.80it/s]\n",
      "2024-06-19 22:40:45,345 - TopMost - Real vocab size: 5000\n",
      "2024-06-19 22:40:45,368 - TopMost - Real training size: 11314 \t avg length: 110.543\n",
      "parsing texts: 100%|██████████| 7532/7532 [00:00<00:00, 9357.91it/s]\n",
      "2024-06-19 22:40:47,296 - TopMost - Real testing size: 7532 \t avg length: 106.663\n",
      "loading word embeddings: 100%|██████████| 5000/5000 [00:00<00:00, 8260.58it/s]\n",
      "2024-06-19 22:41:13,922 - TopMost - number of found embeddings: 4957/5000\n"
     ]
    }
   ],
   "source": [
    "# preprocess raw data\n",
    "preprocess = Preprocess(vocab_size=5000, min_term=1, stopwords='snowball')\n",
    "\n",
    "rst = preprocess.preprocess_jsonlist(dataset_dir='./datasets/20NG', label_name=\"group\")\n",
    "\n",
    "preprocess.save('./datasets/20NG', **rst)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "torch1.13py3.8",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
